The data set analyzed can be obtained from the Kaggle platform. https://www.kaggle.com/silicon99/dft-accident-data
This dataset comes from the UK police forces who collect data on every vehicle collision in the UK from 2005 to 2015.In this project only one csv file named Accidents0515.csv which includes part of information is conisdered.
This project consists of four parts.After preparation, part two is a warmup part, it will focus on the frequency of accidents during the year and the characteristics to start the anaysis. In the third part, it will select three factors to investigate in order to get what influence these factors have on the accident occurence and the accident severtity. Finally, a conclusion will be given
library(ggplot2)
library(dplyr)
library(gridExtra)
library(hexbin)
library(party)
library(rpart)
library(rpart.plot)
library(tidyverse)
library(skimr)
library(DataExplorer)
library(caret)
library(pROC)
library(tidyverse)
library(randomForest)
library(e1071)
library(fpc)
accident <-read.csv("/Users/ame/Documents/R File/Accidents0515.csv")
Analyze the data with str( ):
str(accident)
'data.frame': 1780653 obs. of 32 variables:
$ Accident_Index : chr "200501BS00001" "200501BS00002" "200501BS00003" "200501BS00004" ...
$ Location_Easting_OSGR : int 525680 524170 524520 526900 528060 524770 524220 525890 527350 524550 ...
$ Location_Northing_OSGR : int 178240 181650 182240 177530 179040 181160 180830 179710 177650 180810 ...
$ Longitude : num -0.191 -0.212 -0.206 -0.174 -0.157 ...
$ Latitude : num 51.5 51.5 51.5 51.5 51.5 ...
$ Police_Force : int 1 1 1 1 1 1 1 1 1 1 ...
$ Accident_Severity : int 2 3 3 3 3 3 3 3 3 3 ...
$ Number_of_Vehicles : int 1 1 2 1 1 2 2 1 2 2 ...
$ Number_of_Casualties : int 1 1 1 1 1 1 1 2 2 5 ...
$ Date : chr "04/01/2005" "05/01/2005" "06/01/2005" "07/01/2005" ...
$ Day_of_Week : int 3 4 5 6 2 3 5 6 7 7 ...
$ Time : chr "17:42" "17:36" "00:15" "10:35" ...
$ Local_Authority_.District. : int 12 12 12 12 12 12 12 12 12 12 ...
$ Local_Authority_.Highway. : chr "E09000020" "E09000020" "E09000020" "E09000020" ...
$ X1st_Road_Class : int 3 4 5 3 6 6 5 3 3 4 ...
$ X1st_Road_Number : int 3218 450 0 3220 0 0 0 315 3212 450 ...
$ Road_Type : int 6 3 6 6 6 6 6 3 6 6 ...
$ Speed_limit : int 30 30 30 30 30 30 30 30 30 30 ...
$ Junction_Detail : int 0 6 0 0 0 0 3 0 6 3 ...
$ Junction_Control : int -1 2 -1 -1 -1 -1 4 -1 2 4 ...
$ X2nd_Road_Class : int -1 5 -1 -1 -1 -1 6 -1 4 5 ...
$ X2nd_Road_Number : int 0 0 0 0 0 0 0 0 304 0 ...
$ Pedestrian_Crossing.Human_Control : int 0 0 0 0 0 0 0 0 0 0 ...
$ Pedestrian_Crossing.Physical_Facilities : int 1 5 0 0 0 0 0 0 5 8 ...
$ Light_Conditions : int 1 4 4 1 7 1 4 1 4 1 ...
$ Weather_Conditions : int 2 1 1 1 1 2 1 1 1 1 ...
$ Road_Surface_Conditions : int 2 1 1 1 2 2 1 1 1 1 ...
$ Special_Conditions_at_Site : int 0 0 0 0 0 6 0 0 0 0 ...
$ Carriageway_Hazards : int 0 0 0 0 0 0 0 0 0 0 ...
$ Urban_or_Rural_Area : int 1 1 1 1 1 1 1 1 1 1 ...
$ Did_Police_Officer_Attend_Scene_of_Accident: int 1 1 1 1 1 1 1 1 1 1 ...
$ LSOA_of_Accident_Location : chr "E01002849" "E01002909" "E01002857" "E01002840" ...
There are 1780653 obs and 32 variables consisting of 5 char, 25 int and 2 numeric.
Analyze the data with summary( ):
summary(accident)
Accident_Index Location_Easting_OSGR Location_Northing_OSGR Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles Number_of_Casualties
Length:1780653 Min. : 64950 Min. : 10290 Min. :-7.5162 Min. :49.91 Min. : 1.00 Min. :1.000 Min. : 1.000 Min. : 1.000
Class :character 1st Qu.:376400 1st Qu.: 177990 1st Qu.:-2.3548 1st Qu.:51.49 1st Qu.: 7.00 1st Qu.:3.000 1st Qu.: 1.000 1st Qu.: 1.000
Mode :character Median :441320 Median : 264950 Median :-1.3865 Median :52.27 Median :31.00 Median :3.000 Median : 2.000 Median : 1.000
Mean :440180 Mean : 298513 Mean :-1.4286 Mean :52.57 Mean :30.75 Mean :2.838 Mean : 1.832 Mean : 1.349
3rd Qu.:523424 3rd Qu.: 396570 3rd Qu.:-0.2165 3rd Qu.:53.46 3rd Qu.:46.00 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.: 1.000
Max. :655540 Max. :1208800 Max. : 1.7620 Max. :60.76 Max. :98.00 Max. :3.000 Max. :67.000 Max. :93.000
NA's :138 NA's :138 NA's :138 NA's :138
Date Day_of_Week Time Local_Authority_.District. Local_Authority_.Highway. X1st_Road_Class X1st_Road_Number Road_Type Speed_limit
Length:1780653 Min. :1.000 Length:1780653 Min. : 1.0 Length:1780653 Min. :1.000 Min. : -1 Min. :1.000 Min. : 0.00
Class :character 1st Qu.:2.000 Class :character 1st Qu.:122.0 Class :character 1st Qu.:3.000 1st Qu.: 0 1st Qu.:6.000 1st Qu.:30.00
Mode :character Median :4.000 Mode :character Median :328.0 Mode :character Median :4.000 Median : 128 Median :6.000 Median :30.00
Mean :4.115 Mean :353.3 Mean :4.091 Mean :1008 Mean :5.167 Mean :39.03
3rd Qu.:6.000 3rd Qu.:531.0 3rd Qu.:6.000 3rd Qu.: 725 3rd Qu.:6.000 3rd Qu.:50.00
Max. :7.000 Max. :941.0 Max. :6.000 Max. :9999 Max. :9.000 Max. :70.00
Junction_Detail Junction_Control X2nd_Road_Class X2nd_Road_Number Pedestrian_Crossing.Human_Control Pedestrian_Crossing.Physical_Facilities Light_Conditions Weather_Conditions
Min. :-1.000 Min. :-1.00 Min. :-1.000 Min. : -1.0 Min. :-1.000000 Min. :-1.0000 Min. :1.000 Min. :-1.000
1st Qu.: 0.000 1st Qu.:-1.00 1st Qu.:-1.000 1st Qu.: 0.0 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.:1.000 1st Qu.: 1.000
Median : 2.000 Median : 2.00 Median : 3.000 Median : 0.0 Median : 0.000000 Median : 0.0000 Median :1.000 Median : 1.000
Mean : 2.333 Mean : 1.81 Mean : 2.663 Mean : 378.3 Mean : 0.009261 Mean : 0.7375 Mean :1.951 Mean : 1.576
3rd Qu.: 3.000 3rd Qu.: 4.00 3rd Qu.: 6.000 3rd Qu.: 0.0 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.:4.000 3rd Qu.: 1.000
Max. : 9.000 Max. : 4.00 Max. : 6.000 Max. :9999.0 Max. : 2.000000 Max. : 8.0000 Max. :7.000 Max. : 9.000
Road_Surface_Conditions Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location
Min. :-1.000 Min. :-1.0000 Min. :-1.00000 Min. :1.000 Min. :-1.000 Length:1780653
1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.:1.000 1st Qu.: 1.000 Class :character
Median : 1.000 Median : 0.0000 Median : 0.00000 Median :1.000 Median : 1.000 Mode :character
Mean : 1.358 Mean : 0.1091 Mean : 0.07219 Mean :1.356 Mean : 1.193
3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.:2.000 3rd Qu.: 1.000
Max. : 5.000 Max. : 7.0000 Max. : 7.00000 Max. :3.000 Max. : 3.000
The amount of data is very large.
Analyze the data with head( ):
head(accident)
Count the number of NA for each classification:
apply(is.na(accident), 2, sum)
Accident_Index Location_Easting_OSGR Location_Northing_OSGR Longitude
0 138 138 138
Latitude Police_Force Accident_Severity Number_of_Vehicles
138 0 0 0
Number_of_Casualties Date Day_of_Week Time
0 0 0 0
Local_Authority_.District. Local_Authority_.Highway. X1st_Road_Class X1st_Road_Number
0 0 0 0
Road_Type Speed_limit Junction_Detail Junction_Control
0 0 0 0
X2nd_Road_Class X2nd_Road_Number Pedestrian_Crossing.Human_Control Pedestrian_Crossing.Physical_Facilities
0 0 0 0
Light_Conditions Weather_Conditions Road_Surface_Conditions Special_Conditions_at_Site
0 0 0 0
Carriageway_Hazards Urban_or_Rural_Area Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location
0 0 0 0
Numerical and integer variables are selected as the independent variables of the research object: We first use the decision tree model for a general browse, and then determine the specific multi classification. After that, we use other models respectively. We classify the following steps as data preprocessing. #### (4)Select the appropriate dependent variable
#1.Accident_Severity
#2.Number_of_Vehicles
#3.Number_of_Casualties
#4.Road_Type
#5.Speed_limit
#6.Junction_Detail
#7.Junction_Control
#8.Pedestrian_Crossing.Human_Control
#9.Pedestrian_Crossing
#10.Light_Conditions
#11.Weather_Conditions
#12.Road_Surface_Conditions
#13.Special_Conditions_at_Site
#14.Carriageway_Hazards
#15.Urban_or_Rural_Area
data1<- subset(accident,select=c(Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Road_Type,Speed_limit,Junction_Detail,Junction_Control,Pedestrian_Crossing.Human_Control,Pedestrian_Crossing.Human_Control,Pedestrian_Crossing.Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area))
data1 <- na.omit(data1)
plot_missing(data1) #Show missing values
data1$Accident_Severity<- factor(data1$Accident_Severity) #Convert numeric category to factor category
skim(data1) #Skim data after data processing
── Data Summary ────────────────────────
Values
Name data1
Number of rows 1780653
Number of columns 16
_______________________
Column type frequency:
factor 1
numeric 15
________________________
Group variables None
table(data1$Accident_Severity) #Distribution of dependent variables
1 2 3
22998 242080 1515575
data1$Accident_Severity <- fct_collapse(data1$Accident_Severity,"1"=c("2"),"2"=c("3")) #Merge factor
data1 #After data processing
set.seed(35)
#The set. Seed() function is used to ensure that the random numbers you randomly generate are consistent
trains <- createDataPartition(
y= data1$Accident_Severity,
p=0.7,
list= F
)
traindata <- data1[trains,] #70% training data
testdata <- data1[-trains,] #30% testing data
#Distribution of dependent variables after splitting
table(traindata$Accident_Severity)
1 2
185555 1060903
table(testdata$Accident_Severity)
1 2
79523 454672
colnames(data1)
[1] "Accident_Severity" "Number_of_Vehicles" "Number_of_Casualties" "Road_Type"
[5] "Speed_limit" "Junction_Detail" "Junction_Control" "Pedestrian_Crossing.Human_Control"
[9] "Pedestrian_Crossing.Human_Control.1" "Pedestrian_Crossing.Physical_Facilities" "Light_Conditions" "Weather_Conditions"
[13] "Road_Surface_Conditions" "Special_Conditions_at_Site" "Carriageway_Hazards" "Urban_or_Rural_Area"
#Dependent variable and independent variable construction formula:
form_clsm <- as.formula(
paste0(
"Accident_Severity ~ ",
paste(colnames(traindata)[2:16],collapse = " + ")
)
)
#Building decision tree model
set.seed(35)
fit_dt_clsm <- rpart(
form_clsm,
data = traindata,
method = "anova",
parms = list(split = "information"),
control = rpart.control(cp=0.005)
)
#Complexity related data
printcp(fit_dt_clsm)
Regression tree:
rpart(formula = form_clsm, data = traindata, method = "anova",
parms = list(split = "information"), control = rpart.control(cp = 0.005))
Variables actually used in tree construction:
[1] Number_of_Vehicles Speed_limit
Root node error: 157932/1246458 = 0.1267
n= 1246458
CP nsplit rel error xerror xstd
1 0.0151152 0 1.00000 1.00000 0.0017671
2 0.0056297 1 0.98488 0.98489 0.0017310
3 0.0050000 2 0.97926 0.97926 0.0017135
plotcp(fit_dt_clsm)
#post-pruning
fit_dt_clsm_pruned <- prune(fit_dt_clsm, cp = 0.0016)
#Variable importance
fit_dt_clsm_pruned$variable.importance
Number_of_Vehicles Speed_limit Urban_or_Rural_Area Light_Conditions Carriageway_Hazards Special_Conditions_at_Site
2392.738099 889.172511 365.852066 70.213258 23.658808 10.332987
Road_Surface_Conditions
2.830593
varimpdata <- data.frame(importance = fit_dt_clsm_pruned$variable.importance)
varimpdata
#Drawing
ggplot(varimpdata,aes(x = as_factor(rownames(varimpdata)),y= importance))+geom_col() + labs(x= "variables")+ theme(axis.text.x = element_text(angle = 15,hjust = 1))+ theme_classic()
We select the first four factors that most affect the dependent variables as the research objects of the next three models, and up to now, all data preprocessing has been completed. ### [2]Classification #### (1)Decision_tree Model We chose accident_ Severity,Number_ of_ Vehicles,Speed_ limit,Urban_ or_ Rural_ Area,Light_ Conditions is used as a new data set to build different model research data sets.
Decision_tree<- subset(accident,select=c(Accident_Severity,Number_of_Vehicles,Speed_limit,Urban_or_Rural_Area,Light_Conditions))
Decision_tree <- na.omit(Decision_tree)
skim(Decision_tree)
── Data Summary ────────────────────────
Values
Name Decision_tree
Number of rows 1780653
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None
plot_missing(Decision_tree)
#Variable type correction
Decision_tree$Accident_Severity<- factor(Decision_tree$Accident_Severity)
Decision_tree$Accident_Severity <- fct_collapse(Decision_tree$Accident_Severity,"1"=c("2"))
Decision_tree$Accident_Severity <- fct_collapse(Decision_tree$Accident_Severity,"2"=c("3"))
#Skim view after data processing
skim(Decision_tree)
── Data Summary ────────────────────────
Values
Name Decision_tree
Number of rows 1780653
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
Decision_tree
#########################
#Split data
set.seed(35)
trains <- createDataPartition(
y= data1$Accident_Severity,
p=0.7,
list= F
)
traindata <- Decision_tree[trains,]
testdata <- Decision_tree[-trains,]
hist(traindata$Number_of_Vehicles,breaks = 50)
hist(traindata$Speed_limit,breaks = 50)
hist(traindata$Urban_or_Rural_Area,breaks = 50)
hist(traindata$Light_Conditions,breaks = 50)
##############################
#Dependent variable and independent variable construction formula
colnames(Decision_tree)
[1] "Accident_Severity" "Number_of_Vehicles" "Speed_limit" "Urban_or_Rural_Area" "Light_Conditions"
form_clsm <- as.formula(
paste0(
"Accident_Severity ~ ",
paste(colnames(traindata)[2:5],collapse = " + ")
)
)
form_clsm
Accident_Severity ~ Number_of_Vehicles + Speed_limit + Urban_or_Rural_Area +
Light_Conditions
#Training model
set.seed(35)
fit_dt_clsm<- rpart(
form_clsm,
data = traindata,
method = "anova",
parms = list(split = "information"),
control = rpart.control(cp = 0.001)
)
#Primitive regression tree
fit_dt_clsm
n= 1246458
node), split, n, deviance, yval
* denotes terminal node
1) root 1246458 157932.200 1.851134
2) Number_of_Vehicles< 1.5 376978 63694.950 1.784672
4) Light_Conditions>=2.5 126372 23409.720 1.754471 *
5) Light_Conditions< 2.5 250606 40111.840 1.799901 *
3) Number_of_Vehicles>=1.5 869480 91850.070 1.879950
6) Speed_limit>=45 226211 32508.220 1.826025
12) Speed_limit< 65 159088 25182.440 1.802832
24) Urban_or_Rural_Area>=1.5 141265 23392.530 1.790528 *
25) Urban_or_Rural_Area< 1.5 17823 1599.028 1.900353 *
13) Speed_limit>=65 67123 7037.385 1.880995 *
7) Speed_limit< 45 643269 58452.730 1.898913 *
#Complexity related data
printcp(fit_dt_clsm)
Regression tree:
rpart(formula = form_clsm, data = traindata, method = "anova",
parms = list(split = "information"), control = rpart.control(cp = 0.001))
Variables actually used in tree construction:
[1] Light_Conditions Number_of_Vehicles Speed_limit Urban_or_Rural_Area
Root node error: 157932/1246458 = 0.1267
n= 1246458
CP nsplit rel error xerror xstd
1 0.0151152 0 1.00000 1.00000 0.0017671
2 0.0056297 1 0.98488 0.98489 0.0017310
3 0.0018261 2 0.97926 0.97926 0.0017135
4 0.0012087 3 0.97743 0.97743 0.0017091
5 0.0010979 4 0.97622 0.97617 0.0017063
6 0.0010000 5 0.97512 0.97529 0.0017052
plotcp(fit_dt_clsm)
#post-pruning
fit_dt_clsm_pruned <- prune(fit_dt_clsm, cp = 0.0012)
print(fit_dt_clsm_pruned)
n= 1246458
node), split, n, deviance, yval
* denotes terminal node
1) root 1246458 157932.200 1.851134
2) Number_of_Vehicles< 1.5 376978 63694.950 1.784672 *
3) Number_of_Vehicles>=1.5 869480 91850.070 1.879950
6) Speed_limit>=45 226211 32508.220 1.826025
12) Speed_limit< 65 159088 25182.440 1.802832
24) Urban_or_Rural_Area>=1.5 141265 23392.530 1.790528 *
25) Urban_or_Rural_Area< 1.5 17823 1599.028 1.900353 *
13) Speed_limit>=65 67123 7037.385 1.880995 *
7) Speed_limit< 45 643269 58452.730 1.898913 *
#Variable importance value
fit_dt_clsm_pruned$variable.Accident_Severity
NULL
#Variable importance diagram
varimpdata <-
data.frame(importance = fit_dt_clsm_pruned$variable.importance)
ggplot(varimpdata,
aes(x = as_factor(rownames(varimpdata)), y = importance)) +
geom_col()+
labs(x = "variables")+theme(axis.test.x = element_text(angle = 15,hjust=1))+
theme_classic()
#Tree diagram
prp(fit_dt_clsm_pruned,
type = 1,
extra = 101,
fallen.leaves = TRUE,
main= "Decision Tree")
####################################
#Forecast
#Training set prediction results
trainpred <- predict(fit_dt_clsm_pruned, newdata = traindata)
#Prediction error index of training set
train <- data.frame(obs = traindata$Accident_Severity, pred = trainpred)
#Figure training set prediction results
plot(x = traindata$Accident_Severity,
y = trainpred,
xlab = "Actual",
ylab = "Predication",
main = "Compare",
sub = "Train")
trainlinmod <- lm(trainpred ~ traindata$Accident_Severity)
abline(trainlinmod, col = "blue", lwd = 2.5, lty = "solid")
abline(a = 0, b=1, col = "red", lwd = 2.5, lty = "dashed")
legend("topleft",
legend = c("Model", "Base"),
col = c("blue" , "red") ,
lwd = 2.5,
lty = c("solid", "dashed"))
#Test set prediction results
testpred <- predict(fit_dt_clsm_pruned, newdata = testdata)
#Test set prediction error index
test <- data.frame(obs = testdata$Accident_Severity, pred = testpred)
#Graphical test set prediction results
plot(x = testdata$Accident_Severity,
y = testpred,
xlab = "Actual",
ylab = "Predication",
main = "Compare",
sub = "Train")
testlinmod <- lm(testpred ~ testdata$Accident_Severity)
abline(testlinmod, col = "blue", lwd = 2.5, lty = "solid")
abline(a = 0, b=1, col = "red", lwd = 2.5, lty = "dashed")
legend("topleft",
legend = c("Model", "Base"),
col = c("blue" , "red") ,
lwd = 2.5,
lty = c("solid", "dashed"))
#Centralized display of prediction results of training set and test set
predresult <-
data.frame(bos = c(traindata$Accident_Severity, testdata$Accident_Severity),
pred = c(trainpred, testpred),
group = c(rep("Train", length(trainpred)),
rep("Test", length(testpred))))
ggplot(predresult,
aes(x = bos , y = pred, fill = group , color = group)) +
geom_point(shape = 21, size = 3)+
geom_smooth(method = "lm", se = F, size = 1.2)+
geom_abline(intercept = 0, slope = 1, size = 1.2) +
labs(fill = NULL, colour = NULL) +
theme(legend.position = "bottom")
`geom_smooth()` using formula 'y ~ x'
Decision tree is a basic classification and regression method. From the decision tree, we can find that the root node is number_ of_ Vehicles, followed by speed_ Limit, and finally urban_ or_ Rural_ Area. However, because the data are not continuous variables, we find that the model constructed by the decision tree is quite different from the actual value, and the prediction success rate is also low, so it is obvious that this model is not a wise choice.
Random_Forest <- subset(accident,select=c(Accident_Severity,Number_of_Vehicles,Speed_limit,Urban_or_Rural_Area,Light_Conditions))
Random_Forest$Accident_Severity<- factor(Random_Forest$Accident_Severity)
Random_Forest$Accident_Severity <- fct_collapse(Random_Forest$Accident_Severity,"1"=c("2"))
Random_Forest$Accident_Severity <- fct_collapse(Random_Forest$Accident_Severity,"2"=c("3"))
Random_Forest <- na.omit(Random_Forest)
#Split data
set.seed(35)
trains <- createDataPartition(
y = Random_Forest$Accident_Severity,
p = 0.7,
list = F
)
traindata <- Random_Forest[trains,]
testdata <- Random_Forest[-trains,]
#Distribution of dependent variables
table(Random_Forest$Accident_Severity)
1 2
265078 1515575
#Independent variable dependent variable construction formula
colnames(Random_Forest)
[1] "Accident_Severity" "Number_of_Vehicles" "Speed_limit" "Urban_or_Rural_Area" "Light_Conditions"
form_clsm <- as.formula(
paste0(
"Accident_Severity ~ " ,
paste(colnames(traindata)[2:5], collapse = "+")
)
)
form_clsm
Accident_Severity ~ Number_of_Vehicles + Speed_limit + Urban_or_Rural_Area +
Light_Conditions
#Build model
set.seed(35)
fit_rf_clsm <- randomForest(
form_clsm,
data = traindata,
ntree = 50,
mrty = 4,
importance = T
)
fit_rf_clsm
Call:
randomForest(formula = form_clsm, data = traindata, ntree = 50, mrty = 4, importance = T)
Type of random forest: classification
Number of trees: 50
No. of variables tried at each split: 2
OOB estimate of error rate: 14.89%
Confusion matrix:
1 2 class.error
1 7 185548 9.999623e-01
2 13 1060890 1.225371e-05
plot(fit_rf_clsm,main = "ERROR & TREES")
legend(
"top",legend = colnames(fit_rf_clsm$err.rate),
lty = 1:4,
col = 1:4,
horiz = T
)
#Variable importance
importance(fit_rf_clsm)
1 2 MeanDecreaseAccuracy MeanDecreaseGini
Number_of_Vehicles 0.8954057 6.705843 6.832288 4949.4271
Speed_limit -8.7026074 9.337598 8.815774 1893.0860
Urban_or_Rural_Area -4.8793774 6.517263 6.430765 1126.9299
Light_Conditions 2.0582818 1.669483 2.474432 847.2057
varImpPlot(fit_rf_clsm,type = 1)
varImpPlot(fit_rf_clsm,type = 2)
#Partial correlation diagram
partialPlot(x = fit_rf_clsm,
pred.data = traindata,
x.var = Number_of_Vehicles,
which.class = "1",
ylab= "1")
partialPlot(x = fit_rf_clsm,
pred.data = traindata,
x.var = Speed_limit ,
which.class = "1",
ylab= "1")
partialPlot(x = fit_rf_clsm,
pred.data = traindata,
x.var = Urban_or_Rural_Area,
which.class = "1",
ylab= "1")
partialPlot(x = fit_rf_clsm,
pred.data = traindata,
x.var = Light_Conditions,
which.class = "1",
ylab= "1")
#Forecast
#Training set prediction probability
trainpredprob <- predict(fit_rf_clsm, newdata = traindata, type = "prob")
multiclass.roc(response= traindata$Accident_Severity, predictor = trainpredprob)
Call:
multiclass.roc.default(response = traindata$Accident_Severity, predictor = trainpredprob)
Data: multivariate predictor trainpredprob with 2 levels of traindata$Accident_Severity: 1, 2.
Multi-class area under the curve: 0.5004
trainpredlab <- predict(fit_rf_clsm, newdata = traindata,type = "class")
confusionMatrix(data = trainpredlab,
reference = traindata$Accident_Severity,
mode = "everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 18 1
2 185537 1060902
Accuracy : 0.8511
95% CI : (0.8505, 0.8518)
No Information Rate : 0.8511
P-Value [Acc > NIR] : 0.4836
Kappa : 2e-04
Mcnemar's Test P-Value : <2e-16
Sensitivity : 9.701e-05
Specificity : 1.000e+00
Pos Pred Value : 9.474e-01
Neg Pred Value : 8.511e-01
Precision : 9.474e-01
Recall : 9.701e-05
F1 : 1.940e-04
Prevalence : 1.489e-01
Detection Rate : 1.444e-05
Detection Prevalence : 1.524e-05
Balanced Accuracy : 5.000e-01
'Positive' Class : 1
RandomForestFinal1<- data.frame(multiClassSummary(
data.frame(obs = traindata$Accident_Severity,pred = trainpredlab),
lev = levels(traindata$Accident_Severity)
))
RandomForestFinal1
#Test set prediction probability
testpredprob <- predict(fit_rf_clsm,newdata = testdata, type= "prob")
#Test set ROC
multiclass.roc(response = testdata$Accident_Severity, predictor = testpredprob)
Call:
multiclass.roc.default(response = testdata$Accident_Severity, predictor = testpredprob)
Data: multivariate predictor testpredprob with 2 levels of testdata$Accident_Severity: 1, 2.
Multi-class area under the curve: 0.5002
#Test set prediction classification
testpredlab <- predict(fit_rf_clsm, newdata = testdata, type = "class")
#Test set confusion matrix
confusionMatrix(data= testpredlab,
reference = testdata$Accident_Severity,
mode = "everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 4 6
2 79519 454666
Accuracy : 0.8511
95% CI : (0.8502, 0.8521)
No Information Rate : 0.8511
P-Value [Acc > NIR] : 0.504
Kappa : 1e-04
Mcnemar's Test P-Value : <2e-16
Sensitivity : 5.030e-05
Specificity : 1.000e+00
Pos Pred Value : 4.000e-01
Neg Pred Value : 8.511e-01
Precision : 4.000e-01
Recall : 5.030e-05
F1 : 1.006e-04
Prevalence : 1.489e-01
Detection Rate : 7.488e-06
Detection Prevalence : 1.872e-05
Balanced Accuracy : 5.000e-01
'Positive' Class : 1
#Test set synthesis results
RandomForestFinal2<- data.frame(multiClassSummary(
data.frame(obs = testdata$Accident_Severity, pred = testpredlab),
lev = levels(testdata$Accident_Severity)
))
RandomForestFinal2
NA
The principle of the Random_Forest model is to integrate many decision trees into a forest and use them together to predict the final result. We can see random_ The accuracy of training set and test set of forest model is about 85%, which can be regarded as a good ideal model.
SVM <- subset(accident,select=c(Accident_Severity,Number_of_Vehicles,Speed_limit,Urban_or_Rural_Area,Light_Conditions))
SVM$Accident_Severity<- factor(SVM$Accident_Severity)
SVM$Accident_Severity <- fct_collapse(SVM$Accident_Severity,"1"=c("2"))
SVM$Accident_Severity <- fct_collapse(SVM$Accident_Severity,"2"=c("3"))
SVM <- na.omit(SVM)
#Split data
set.seed(35)
trains <- createDataPartition(
y = SVM$Accident_Severity,
p = 0.02,
list = F
)
traindata <- SVM[trains,]
trains2 <- createDataPartition(
y = SVM$Accident_Severity,
p = 0.99,
list = F
)
testdata <- SVM[-trains2,]
traindata
testdata
#Distribution of dependent variables
table(SVM$Accident_Severity)
1 2
265078 1515575
#Independent variable dependent variable construction formula
colnames(SVM)
[1] "Accident_Severity" "Number_of_Vehicles" "Speed_limit" "Urban_or_Rural_Area" "Light_Conditions"
form_clsm <- as.formula(
paste0(
"Accident_Severity ~ " ,
paste(colnames(traindata)[2:5], collapse = "+")
)
)
form_clsm
Accident_Severity ~ Number_of_Vehicles + Speed_limit + Urban_or_Rural_Area +
Light_Conditions
#Build model
fit_svm_clsm <- svm(form_clsm,
data = traindata,
kernel = "radial",
cost = 1,
probability = T)
fit_svm_clsm
Call:
svm(formula = form_clsm, data = traindata, kernel = "radial", cost = 1, probability = T)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 10719
#Training set prediction probability
trainpred <- predict(fit_svm_clsm,newdata = traindata,probability = T)
trainpredprob <- attr(trainpred,"probabilities")
#Training set ROC
multiclass.roc(response = traindata$Accident_Severity,predictor = trainpredprob)
Call:
multiclass.roc.default(response = traindata$Accident_Severity, predictor = trainpredprob)
Data: multivariate predictor trainpredprob with 2 levels of traindata$Accident_Severity: 2, 1.
Multi-class area under the curve: 0.444
#Training set confusion matrix
confusionMatrix(data = trainpred,
reference = traindata$Accident_Severity,
mode = "everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 0 0
2 5302 30312
Accuracy : 0.8511
95% CI : (0.8474, 0.8548)
No Information Rate : 0.8511
P-Value [Acc > NIR] : 0.5037
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.8511
Precision : NA
Recall : 0.0000
F1 : NA
Prevalence : 0.1489
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000
'Positive' Class : 1
#Training set synthesis results
SVMFinal1 <- data.frame(multiClassSummary(
data.frame(obs = traindata$Accident_Severity,pred = trainpred),
lev = levels(traindata$Accident_Severity)
))
SVMFinal1
#Test set prediction probability
testpred <- predict(fit_svm_clsm,newdata = testdata,probability = T)
testpredprob <- attr(testpred,"probabilities")
#Test set ROC
multiclass.roc(response = testdata$Accident_Severity,predictor = testpredprob)
Call:
multiclass.roc.default(response = testdata$Accident_Severity, predictor = testpredprob)
Data: multivariate predictor testpredprob with 2 levels of testdata$Accident_Severity: 2, 1.
Multi-class area under the curve: 0.4315
#Test set confusion matrix
confusionMatrix(data = testpred,
reference = testdata$Accident_Severity,
mode = "everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 0 0
2 2650 15155
Accuracy : 0.8512
95% CI : (0.8459, 0.8564)
No Information Rate : 0.8512
P-Value [Acc > NIR] : 0.5052
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.8512
Precision : NA
Recall : 0.0000
F1 : NA
Prevalence : 0.1488
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000
'Positive' Class : 1
#Test set synthesis results
SVMFinal2 <- data.frame(multiClassSummary(
data.frame(obs = testdata$Accident_Severity,pred = testpred),
lev = levels(testdata$Accident_Severity)
))
SVMFinal2
Support vector machines (SVM) is a binary classification model. Its basic model is the linear classifier with the largest interval defined in the feature space. The learning algorithm of SVM is the optimization algorithm for solving convex quadratic programming. We can find that the prediction probability of training set and test set is more than 85%, which is an ideal model.
In this section, we will use hierarchical clustering and kMeans clustering to do the clustering task. Background: there are 51 police organizations in Britain which are in charge of different countries.When an accident occurs, the nearest police organization will handle that accident. We want to use clustering to help police organizations own more reasonable manpower management.The idea is that if a police organization always handles serious accidents, more police constables should be allocated to it.
accident <- na.omit(accident)
data_clustering_orignial <- accident
#convert indicator back into categorical variables, which is more readable.
data_levels<-c(1,3:7,10:14,16,17,20:23,30:37,40:48,50,52:55,60:63,91:98)
data_labels<-c("Metropolitan Police","Cumbria","Lancashire","Merseyside","Greater Manchester","Cheshire","Northumbria","Durham","North Yorkshire","West Yorkshire","South Yorkshire","Humberside","Cleveland","West Midlands","Staffordshire","West Mercia","Warwickshire","Derbyshire","Nottinghamshire","Lincolnshire","Leicestershire","Northamptonshire","Cambridgeshire","Norfolk","Suffolk","Bedfordshire","Hertfordshire","Essex","Thames Valley","Hampshire","Surrey","Kent","Sussex","City of London","Devon and Cornwall","Avon and Somerset","Gloucestershire","Wiltshire","Dorset","North Wales","Gwent","South Wales","Dyfed-Powys","Northern","Grampian","Tayside","Fife","Lothian and Borders","Central","Strathclyde","Dumfries and Galloway")
data_clustering_orignial$Police_Force<-factor(data_clustering_orignial$Police_Force,levels=data_levels,labels=data_labels)
data_clustering_orignial$Police_Force<-as.character(data_clustering_orignial$Police_Force)
#To measure the severity of an accident, number of casualties and vehicles should be considered. What's more, sometimes there could be a fatal accident in which only one casualty, so the severity from the data set should be included here as well.
#It is noted that the larger value of Mean_Accident_Severity a police force has, the severity is lighter.
data_clustering_orignial<- data_clustering_orignial%>% select(Police_Force,Number_of_Vehicles,Number_of_Casualties,Accident_Severity)
data_clustering_groupby_PF <- data_clustering_orignial %>% group_by(Police_Force) %>% summarise(Mean_Number_of_Vehicles = mean(Number_of_Vehicles),Mean_Number_of_Casualties = mean(Number_of_Casualties),Mean_Accident_Severity = mean(Accident_Severity))
data_clustering <- data_clustering_groupby_PF %>% select(-Police_Force)
#We do scaling here to make all the variable be zero-mean and standard variation.
scale(data_clustering)
Mean_Number_of_Vehicles Mean_Number_of_Casualties Mean_Accident_Severity
[1,] 0.91710415 -0.498056140 0.84684321
[2,] 0.93212837 0.105332383 0.60815623
[3,] 0.43870243 -0.299348864 0.20040155
[4,] -1.15248973 -0.733786347 -1.22041368
[5,] 1.15626375 0.663709963 -0.21016392
[6,] -1.95070810 -3.559402386 0.71282017
[7,] 0.20051410 0.718373132 0.07698382
[8,] -0.48962931 0.411042088 0.11171513
[9,] 0.83529724 0.116144993 0.28420487
[10,] -0.28477635 -0.005128833 1.11317007
[11,] 0.86031052 -0.464312815 0.01491648
[12,] -2.38609513 -0.261060762 -2.22263949
[13,] 0.03362173 1.560593705 0.30348931
[14,] 0.68507370 1.341546448 -0.93746027
[15,] 0.89030912 -0.717344208 -0.22250495
[16,] -0.98829247 -0.872476384 -1.08922701
[17,] 0.44990901 -0.268242448 -0.36802903
[18,] -2.38912681 -1.509355561 -2.75215315
[19,] 0.52677972 0.457342481 1.01778211
[20,] -0.02549425 0.908523297 0.08636330
[21,] 0.24706658 -1.513505113 -0.32904710
[22,] 1.47047422 0.579123704 0.83944505
[23,] -0.05567267 0.871208163 -0.33831665
[24,] 0.57277911 -0.239819299 1.13557018
[25,] 0.03217347 1.133957132 -0.19490287
[26,] 0.80748151 -0.402025230 1.03232098
[27,] -0.82678892 0.750415580 0.26146740
[28,] -1.46996909 -1.457937243 0.33535722
[29,] 0.32051120 1.639541444 0.32485340
[30,] -0.36937899 -2.444657105 1.33265563
[31,] 0.19704424 0.165605384 -0.33682324
[32,] -0.26951179 1.119482436 -0.25840952
[33,] 0.16053256 0.855202224 -1.63312522
[34,] 0.84828763 -0.486637092 -1.68467660
[35,] -2.95993126 1.151300851 -1.44846223
[36,] -0.11463141 0.142580774 1.02532055
[37,] 0.14570649 -0.274019131 -0.03004011
[38,] 0.50478618 0.994715970 1.28910434
[39,] 0.12317579 0.723193526 0.92938351
[40,] 1.03057793 0.225119073 1.98412156
[41,] -1.33863075 -1.055475783 0.07777651
[42,] 0.34928445 -0.137869924 0.32349859
[43,] 1.14495702 0.564396001 0.96193072
[44,] 0.28923024 -0.643859146 -0.64527286
[45,] -1.78259924 -1.192988661 -2.45545326
[46,] 0.76612151 0.021409321 0.42379502
[47,] 0.56430466 0.158438139 -0.33607335
[48,] -0.07393611 0.295460092 0.17775182
[49,] 0.40371160 0.056627315 0.84476280
[50,] 0.09508186 1.094708994 0.46104876
[51,] 0.92836031 0.212213863 -0.42381577
attr(,"scaled:center")
Mean_Number_of_Vehicles Mean_Number_of_Casualties Mean_Accident_Severity
1.817966 1.367618 2.819472
attr(,"scaled:scale")
Mean_Number_of_Vehicles Mean_Number_of_Casualties Mean_Accident_Severity
0.08326333 0.06865388 0.04260056
#we calculate the euclidean distance of each data point in the data_clustering
d <- dist(data_clustering, method="euclidean")
#Here we use Ward's method to do grouping. The idea of this algorithm is to minimize the variance within a cluster. When finishing grouping, we use plot function to display our cluster dendrogram. As we want to divide the data into 6 groups, I use function rect.hclust, which will draw k=4 rectangles on the dendrogram to show the grouping.
pfit <- hclust(d, method="ward.D2")
par(mar = c(0, 0, 0, 0))
plot(pfit,labels=data_clustering_groupby_PF$Police_Force,main="Cluster Dendrogram")
rect.hclust(pfit, k=6)
#Here we use cutree function to return the cluster each data point belong to. Here we want 6 groups.
groups <- cutree(pfit, k=6)
#this function is used to combine the columns we want from the data frame and the cluster it belong to and create a new data frame.
print_clusters <- function(df, groups, cols_to_print) {
Ngroups <- max(groups)
for (i in 1:Ngroups) {
print(paste("cluster", i))
print(df[groups == i, cols_to_print])
}
}
cols_to_print <- c("Police_Force","Mean_Number_of_Casualties","Mean_Number_of_Vehicles","Mean_Accident_Severity")
print_clusters(data_clustering_groupby_PF, groups, cols_to_print)
[1] "cluster 1"
[1] "cluster 2"
[1] "cluster 3"
[1] "cluster 4"
[1] "cluster 5"
[1] "cluster 6"
VIsualising
#To visualize the cluster, we need to project our data point onto a 2D plane, for which we need to use a dimension reduction method called Principal Component Analysis. This method will calculate the principal components which can reserve the variation and sort them in a descending order using the degree of the reservation. Here we use the first 2 principal components to map our datapoint to a 2D plane.
princ <- prcomp(data_clustering)
nComp <-2
project2D <- as.data.frame(predict(princ, newdata=data_clustering)[,1:nComp])
hclust.project2D <- cbind(project2D, cluster=as.factor(groups), Police_Force=data_clustering_groupby_PF$Police_Force)
head(hclust.project2D)
library('grDevices')
#This function is used to draw a convex hull using a group of data point. proj2Ddf is the datapoint which is now in 2D, groups is a vector of the cluster each point belong to.
find_convex_hull <- function(proj2Ddf, groups) {
do.call(rbind,lapply(unique(groups),
FUN = function(c) {
f <- subset(proj2Ddf, cluster==c);f[chull(f),]
}
)
)
}
hclust.hull <- find_convex_hull(hclust.project2D, groups)
library(ggplot2)
#Here we use ggplot to draw the convex hull and place those data points in it. What's more, to show those data points, the convex hull will be translunce and the alpha is 0.4.
ggplot(hclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label=c(1,nrow(princ)), color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.hull, aes(group=cluster, fill=as.factor(cluster)),
alpha=0.4, linetype=0) + theme(text=element_text(size=20))
#We use clusterboot to check the stability of the cluster. The idea of it is that it will resample from the data frame and form some new clusters and see the similarity of the new clusters and the old ones. If the similarity is high, it means the cluster is highly stable.
#Here, we input our dataframe data_clustering, tell the function the grouping method we use is Ward's and the function need to create 6 cluster in each iteration. It will run 100 iterations and the result will be recorded in cboot.hclust. When getting the result, we do a summary.
library(fpc)
kbest.p <- 6
cboot.hclust <- clusterboot(data_clustering, clustermethod=hclustCBI,method="ward.D2", k=kbest.p)
boot 1
boot 2
boot 3
boot 4
boot 5
boot 6
boot 7
boot 8
boot 9
boot 10
boot 11
boot 12
boot 13
boot 14
boot 15
boot 16
boot 17
boot 18
boot 19
boot 20
boot 21
boot 22
boot 23
boot 24
boot 25
boot 26
boot 27
boot 28
boot 29
boot 30
boot 31
boot 32
boot 33
boot 34
boot 35
boot 36
boot 37
boot 38
boot 39
boot 40
boot 41
boot 42
boot 43
boot 44
boot 45
boot 46
boot 47
boot 48
boot 49
boot 50
boot 51
boot 52
boot 53
boot 54
boot 55
boot 56
boot 57
boot 58
boot 59
boot 60
boot 61
boot 62
boot 63
boot 64
boot 65
boot 66
boot 67
boot 68
boot 69
boot 70
boot 71
boot 72
boot 73
boot 74
boot 75
boot 76
boot 77
boot 78
boot 79
boot 80
boot 81
boot 82
boot 83
boot 84
boot 85
boot 86
boot 87
boot 88
boot 89
boot 90
boot 91
boot 92
boot 93
boot 94
boot 95
boot 96
boot 97
boot 98
boot 99
boot 100
summary(cboot.hclust$result)
Length Class Mode
result 7 hclust list
noise 1 -none- logical
nc 1 -none- numeric
clusterlist 6 -none- list
partition 51 -none- numeric
clustermethod 1 -none- character
nccl 1 -none- numeric
#Calculate the stability of each cluster.
1 - cboot.hclust$bootbrd/100
[1] 0.94 0.82 0.75 0.63 0.95 0.81
We can see that cluster 1,5 are highly stable, 3,6 are stable, and cluster 4 is not good enough which can seen on the cluster visualization as well.
#Function for finding the number of good cluster
# This function will return the distance square of two data point.
sqr_euDist <- function(x, y) {
sum((x - y)^2)
}
# Function to calculate WSS of a cluster, represented as a n-by-d matrix
# (where n and d are the numbers of rows and columns of the matrix)
# which contains only points of the cluster.
wss <- function(clustermat) {
c0 <- colMeans(clustermat)
sum(apply( clustermat, 1, FUN=function(row) {sqr_euDist(row, c0)} ))
}
# Function to calculate the total WSS. Argument `scaled_df`: data frame
# with normalised numerical columns. Argument `labels`: vector containing
# the cluster ID (starting at 1) for each row of the data frame.
wss_total <- function(scaled_df, labels) {
wss.sum <- 0
k <- length(unique(labels))
for (i in 1:k)
wss.sum <- wss.sum + wss(subset(scaled_df, labels == i))
wss.sum
}
# Function to calculate total sum of squared (TSS) distance of data
# points about the (global) mean. This is the same as WSS when the
# number of clusters (k) is 1.
tss <- function(scaled_df) {
wss(scaled_df)
}
# Function to return the CH indices computed using hierarchical
# clustering (function `hclust`) or k-means clustering (`kmeans`)
# for a vector of k values ranging from 1 to kmax.
CH_index <- function(scaled_df, kmax, method="kmeans") {
if (!(method %in% c("kmeans", "hclust")))
stop("method must be one of c('kmeans', 'hclust')")
npts <- nrow(scaled_df)
wss.value <- numeric(kmax) # create a vector of numeric type
# wss.value[1] stores the WSS value for k=1 (when all the
# data points form 1 large cluster).
wss.value[1] <- wss(scaled_df)
if (method == "kmeans") {
# kmeans
for (k in 2:kmax) {
clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
wss.value[k] <- clustering$tot.withinss
}
}
else {
# hclust
d <- dist(scaled_df, method="euclidean")
pfit <- hclust(d, method="ward.D2")
for (k in 2:kmax) {
labels <- cutree(pfit, k=k)
wss.value[k] <- wss_total(scaled_df, labels)
}
}
bss.value <- tss(scaled_df) - wss.value # this is a vector
B <- bss.value / (0:(kmax-1)) # also a vector
W <- wss.value / (npts - 1:kmax) # also a vector
data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}
library(gridExtra)
# calculate the CH criterion
crit.df <- CH_index(data_clustering, 10, method="hclust")
fig1 <- ggplot(crit.df, aes(x=k, y=CH_index)) +
geom_point() + geom_line(colour="red") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="CH index") + theme(text=element_text(size=20))
fig2 <- ggplot(crit.df, aes(x=k, y=WSS), color="blue") +
geom_point() + geom_line(colour="blue") +
scale_x_continuous(breaks=1:10, labels=1:10) +
theme(text=element_text(size=20))
grid.arrange(fig1, fig2, nrow=1)
Warning: Removed 1 rows containing missing values (geom_point).
Warning: Removed 1 row(s) containing missing values (geom_path).
We can see that CH criterion is maximized at k = 2,with another local maximum at k = 8 Aftering trying k=8, we found there would be one cluster is ustable, for which we use k=2 here.
pfit <- hclust(d, method="ward.D2")
plot(pfit, labels=data_clustering_groupby_PF$Police_Force, main="Cluster Dendrogram")
rect.hclust(pfit, k=2)
groups_hierarcy <- cutree(pfit, k=2)
cols_to_print <- c("Police_Force","Mean_Number_of_Casualties","Mean_Number_of_Vehicles","Mean_Accident_Severity")
print_clusters(data_clustering_groupby_PF, groups_hierarcy, cols_to_print)
[1] "cluster 1"
[1] "cluster 2"
kbest.p <- 2
cboot.hclust <- clusterboot(data_clustering, clustermethod=hclustCBI,method="ward.D2", k=kbest.p)
boot 1
boot 2
boot 3
boot 4
boot 5
boot 6
boot 7
boot 8
boot 9
boot 10
boot 11
boot 12
boot 13
boot 14
boot 15
boot 16
boot 17
boot 18
boot 19
boot 20
boot 21
boot 22
boot 23
boot 24
boot 25
boot 26
boot 27
boot 28
boot 29
boot 30
boot 31
boot 32
boot 33
boot 34
boot 35
boot 36
boot 37
boot 38
boot 39
boot 40
boot 41
boot 42
boot 43
boot 44
boot 45
boot 46
boot 47
boot 48
boot 49
boot 50
boot 51
boot 52
boot 53
boot 54
boot 55
boot 56
boot 57
boot 58
boot 59
boot 60
boot 61
boot 62
boot 63
boot 64
boot 65
boot 66
boot 67
boot 68
boot 69
boot 70
boot 71
boot 72
boot 73
boot 74
boot 75
boot 76
boot 77
boot 78
boot 79
boot 80
boot 81
boot 82
boot 83
boot 84
boot 85
boot 86
boot 87
boot 88
boot 89
boot 90
boot 91
boot 92
boot 93
boot 94
boot 95
boot 96
boot 97
boot 98
boot 99
boot 100
summary(cboot.hclust$result)
Length Class Mode
result 7 hclust list
noise 1 -none- logical
nc 1 -none- numeric
clusterlist 2 -none- list
partition 51 -none- numeric
clustermethod 1 -none- character
nccl 1 -none- numeric
1 - cboot.hclust$bootbrd/100
[1] 0.99 0.94
We can see that, these two clusters are highly stable.
Method 2: Kmeans We use Kmeans to verify whether 2 cluster is reasonable.
#We start by forming 5 clusters, 100 starts, and 100 iterations.
kbest.p <- 5
kmClusters <- kmeans(data_clustering, kbest.p, nstart=100, iter.max=100)
kmClusters$centers
Mean_Number_of_Vehicles Mean_Number_of_Casualties Mean_Accident_Severity
1 1.818947 1.423765 2.825298
2 1.691390 1.289558 2.769032
3 1.877970 1.359304 2.831082
4 1.721377 1.161517 2.863041
5 1.595402 1.398177 2.741277
groups <- kmClusters$cluster
print_clusters(data_clustering_groupby_PF, groups, cols_to_print )
[1] "cluster 1"
[1] "cluster 2"
[1] "cluster 3"
[1] "cluster 4"
[1] "cluster 5"
kmClustering.ch <- kmeansruns(data_clustering, krange=1:10, criterion="ch")
kmClustering.ch$bestk
[1] 1
kmClustering.asw <- kmeansruns(data_clustering, krange=1:10, criterion="asw")
kmClustering.asw$bestk
[1] 1
hclusting <- CH_index(data_clustering, 10, method="hclust")
print(hclusting$CH_index)
[1] NaN 51.15214 39.32010 38.90136 35.89318 36.34029 36.39566 37.24519 37.14632 36.19086
library(gridExtra)
kmCritframe <- data.frame(k=1:10, ch=kmClustering.ch$crit,
asw=kmClustering.asw$crit)
fig1 <- ggplot(kmCritframe, aes(x=k, y=ch)) +
geom_point() + geom_line(colour="red") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="CH index") + theme(text=element_text(size=20))
fig2 <- ggplot(kmCritframe, aes(x=k, y=asw)) +
geom_point() + geom_line(colour="blue") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="ASW") + theme(text=element_text(size=20))
grid.arrange(fig1, fig2, nrow=1)
2 is best
fig <- c()
kvalues <- seq(2,5)
for (k in kvalues) {
groups <- kmeans(data_clustering, k, nstart=100, iter.max=100)$cluster
kmclust.project2D <- cbind(project2D, cluster=as.factor(groups),Police_Force=data_clustering_groupby_PF$Police_Force)
kmclust.hull <- find_convex_hull(kmclust.project2D, groups)
assign(paste0("fig", k),
ggplot(kmclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_polygon(data=kmclust.hull, aes(group=cluster, fill=cluster),
alpha=0.4, linetype=0) +
labs(title = sprintf("k = %d", k)) +
theme(legend.position="none", text=element_text(size=20))
)
}
library(gridExtra)
grid.arrange(fig2, fig3, fig4, fig5, nrow=2)
kbest.p <- 2
kmClusters <- kmeans(data_clustering, kbest.p, nstart=100, iter.max=100)
groups_kmeans <- kmClusters$cluster
print_clusters(data_clustering_groupby_PF, groups_kmeans, cols_to_print )
[1] "cluster 1"
[1] "cluster 2"
After using two method, we can see that 2 is the best number of cluster. In these two clusers, we can see that one cluster owns less points but more serious accidents(called cluster A)while the other one owns more points and slight accidents (named cluster B). Therefore, we can do some staff movement from the cluster B to cluster A. For example, transfer some police constables from Devon and Cornwall to Tayside. For another, we can see that the mean number of casualties of data points in cluster A is generally lower than that in cluster B.It reveals that passengers are not a main cause for the accidents. The police organization in the cluster A should educate the public to drive safely and contact the relevant departments to fix some flaws on the roads.
After detailed data cleaning, data analysis, modeling, and clustering, we filtered out the first four categories of data that most affected the outcome of an accident through the decision tree, and found that Random_Forest or SVM was the ideal predictive model. After clustering, After using two methods, we can see that 2 is the best number of the cluster.